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Abstract: We establish a new concentration result for regularized risk mini- 
mizers which is similar to an oracle inequality. Applying this inequality to reg- 
ularized least squares minimizers like least squares support vector machines, 
we show that these algorithms learn with (almost) the optimal rate in some 
specific situations. In addition, for regression our results suggest that using the 
loss function L a (y, t) = \y — t\ a with a near 1 may often be preferable to the 
usual choice of a = 2. 

1. Introduction 

The theoretical understanding of support vector machines (SVMs) and related 
kernel-based methods has been substantially improved in recent years. Based on 
Talagrand's concentration inequality and local Rademacher averages it has recently 
been shown that SVMs for classification can learn with rates up to — under some- 
what realistic assumptions on the data-generating distribution (see [l2| and the 
related work 0]). However, the currently available technique, namely the so-called 
"shrinking technique" in 121], for establishing such rates requires choosing the en- 



tire regularization sequence a-priori. Unfortunately, the optimal regularization se- 
quences usually depend on some features of the data-generating distribution typ- 
ically unknown in practice, and consequently the results derived by the shrinking 
technique have some serious drawbacks. 

In this work we replace the shrinking technique by a localization argument similar 
to the localization argument used in conjunction with local Rademacher averages. 
The key observation for this new localization argument is that regularized risk 
minimizers control the size of the norm in the regularization term by their (excess) 
risk in a non-trivial manner (see Lemma [4.11 for details). As a consequence of this 
observation, we can not only localize with respect to small variances but also with 
respect to small maximum norms. 

Using the above (double) localization we obtain oracle-type inequalities for a 
large class of regularized risk minimizers including support vector machines, and 
regularization networks. For the former we can easily reproduce rates established 
lli 13 1, while for the latter we show some minmax rates in specific situations 



and provide results indicating that using the loss function L a (y,t) = \y — t\ a with 
a near 1 to estimate the regression function may be more robust to both outliers 
and the choice of regularization parameter than the usual choice a = 2. 



2. An oracle inequality for regularized risk minimizers 

Throughout this work we assume that X is compact metric space, Y C [—1,1] is 
compact, P is a Borel probability measure on XxY, and H is a RKHS of continuous 
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functions over X with closed unit ball Bh- It is well-known that H can then be 
continuously embedded into the space of continuous functions C{X) equipped with 
the usual maximum-norm ||.||oo- In order to avoid constants we always assume that 
this embedding has norm 1, i.e. ||.||oo < ||-||ff- 

Furthermore, L : Y x M — > [0, oo) always denotes a continuous function which 
is convex in the second variable. In the following we are particularly interested in 
functions L that satisfy the growth assumptions introduced in Q: 

(1) supL(y,t) < l + |t| Q and sup|L|y X [-t,t](y, .)| x < c L t a ~ l 
y eY yeY 

for some constants a £ [1,2], cl > 0, and all i 6 1, where \h\± denotes the Lip- 
schitz constant of a function h. The functions L will serve as loss functions and 
consequently let us recall the associated L-risk 

where / : X — > R is a measurable function. Note that ((T|) immediately gives 
T^L,p(0) < 1. Furthermore, the minimal L-risk is denoted by 1Z* L P , i.e. 

U* Li p = m£{K L ,p(f) | / : X -> R measurable}, 

and a function attaining this infimum is denoted by /£ P . 

The learning schemes we are interested in are based on an optimization problem 
of the form 

fp.x :=argnrin (\\\ff H + ft L , P (/)) , 

where A > 0. Note that if we identify a training set T = ((xj., yi), . ■ . , (x n , y n )) £ 
(X x Y) n with its empirical measure, then fx,\ denotes the empirical estimators 
of the above learning scheme. Obviously, support vector machines (see e.g. and 
regularization networks (see e.g. are both learning algorithms which fall into 
the above category. 

One way to describe the approximation error of these learning schemes is the 
approximation error function 

o(A) := \\\.fp,x\\ 2 +K L , P (fp.\)-Klp, A>0, 

which we discussed in some detail in [13(. Furthermore in order to deal with the 
complexity of the used RKHSs let us recall that for a subset A C E of a Banach 
space E the covering numbers are defined by 

n 

Af(A,s,E) := minjn > 1 : Ebi, ...,x n € E with A C [J (xi+eB E )\, £ > 0, 

i=l 

where -Be denotes the closed unit ball of E. Given a finite sequence T = (z±, . . . , 
z n ) G Z" we are particularly interested in the Banach space L2(T) which consists 
of all equivalence classes of functions / : Z — * R and which is equipped with the 
norm 

1 n ~ 

(2) ll/IU 2 (T):=(-El/^)lT- 

»=l 

In other words, L2{T) is a i2-space with respect to the empirical measure of 
(zi,...,z„). Furthermore, if T is of the form T = {{x\, yi), . . . , (x n , y n )), and 
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Tx '■= {x\, . . . ,x n ), then the space LziTx) has the obvious meaning. In addition 
to the convention 0° := 1 we utilize the following 

'0 if0<o<l, 
C-i! ,r- : {I if a = 1, 

oo if a > 1 . 

Now we can state the main result of this paper: 

Theorem 2.1. Let H be a RKHS of a continuous kernel over X with ||.|joo < \\-\\h- 
Assume that there are constants a > 1 and < p < 2 such that for all 5 > we 
have 

(4) sup logM (B Hl S,L 2 (T x )) < aS-P. 

TeZ n 

Let L : Y x R — > [0, oo) be a continuous function which is convex in its second 
variable and satisfies (JTJ) . Furthermore, let P be a distribution on X x Y such that 
/£ P exists. Moreover, suppose that for all < A < 1 and all f € \~?Bh we have 

(5) E P (Lof-Lo r LP f < c (ll/IU + 1)* (E p L of-Lo f* LP f 

for some constants c > 1, 1? € (0,1]. and v G [0,2]. Then there exists a constant 
K > 1 such that for all < A < 1, e > 0, x > 1 satisfying 



e > max^a(A) + A 

VA^n/ ' VA^n/ 

we have 



Ka \ S-2 ap -(^ + 2J)(2-p) / J( a \ (2 + p)(2-c) 



2ap + v(2-p) I > I Q(2 + p) 

A 4 n/ \A 4 77, 



Pr^TGZ n : R i ,p(/r,A)-^,P<fl(A) + £j > l-e~* 

where Pr* denotes the outer probability. 

Theorem 12. II is proved in Section |4j Now we proceed to illustrate its utility with 
some applications. 

Example 2.2 (Least square regression with Sobolev spaces). Let us consider the 
least squares loss function which is defined by L(y,t) = (y — t) 2 . Furthermore, 
let us assume that H contains the regression function x i— > E(y|x) and satisfies 
the complexity exponent condition (j4|). In addition let (A„) be a strictly positive 
null-sequence with X n +p/ ' 'n — > oo. Then in Section \5\ we show that our learning 
rate is of the form A„. In particular, if H is a Sobolev space of order m on some 
suitable X c K d , m > d/2, then we have p = d/m, and consequently for A„ := 

2m 2m 2m 

n 2m + d logn our rate becomes n 2m + d logn. This equals the optimal rate n 2m + d 
up to a logarithmic factor (see e.g. 0] and the references therein). 

Example 2.3 (Comparisonof different loss functions used for regression). Consider 
again regression with the squared loss function L2(y,t) = (y — i) 2 defining perfor- 
mance but use the loss function L a (y,t) — \y — t\ a with 1 < a < 2 to determine the 
estimate fx.x- Suppose that H contains the regression function x i— > ¥,(y\x), and 
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satisfies the complexity exponent condition In Section [5] we begin by using the 
oracle inequality of Theorem l2.1l to bound the excess I/ Q -risk Ht a ,p(fT,\) — P . 
When a — 2 we produce the results of Examplc l2.21 When 1 < a < 2 we set A = n~ K 
with k > and observe that when k < we obtain the rate n~ K independently 

of the value of a and when k > th— we obtain the rate n _5 +p + '- K_5 +p' > ^ r ° . We 

2+p 

2 

conclude that the K-optimal learning rate for the L a risk is n~ 2 +p and is achieved 
when k — j^;. Now suppose that the conditional distributions P(y\x) are symmet- 
ric. These results are then combined with a calibration inequality 

7W(/t,a) - tii^.p < y(n La , P (f T ,x) - n* La>P ) 

derived from [ll| to obtain bounds on 7£l 2j p(/t,a) — TZ* L2 P in terms of 1 < a < 2. 



We observe that when n < we obtain the rate n K independently of the value of 

a and when k > 5^ we obtain the rate n~ 2+f+( k_ ~) 2^ _ \y e conclude that the 

K-optimal learning rate for the L2 risk also is n 2 +p and is achieved when k = ■ 
It is important to observe that the rate for fixed k gets worse as a increases towards 
2 and in particular that we have no rates when 2 — (k — j^)(2 + p) < a < 2. When 



a = 1 11, Example 3.25] shows how, even though the loss function is not strictly 
convex, we can obtain a calibration inequality in terms of assumptions concerning 
the concentration about the mean. Consequently with extra assumptions regarding 
concentration about the mean we can apply these methods, but do not carry out 
such calculations here since they they are out of the scope of this paper. Moreover, 
since a = 1 is considered more robust to outliers than a — 2, these results suggest 
that setting a near 1 has some substantial advantages to the usual choice a = 2. 
However, to make such a claim more precise will require considering whether and 
in which sense the assumptions of symmetry and boundedness have been violated. 
Finally, let us now consider when H is a Sobolev space as in Example 12.21 Then it 
is clear that we obtain the same optimal rates for all values of 1 < a < 2, although 
for a near 1 we should concern ourselves with the arising constants. 

Example 2.4 (Hinge loss classification). Let Y := { — 1,1}, L be defined by 
L(y,t) := max{0, 1 — yt}, y £ Y, t £ K, and P be a distribution with Tsybakov 
noise exponent q £ [0, 00] in the sense of [H, [l3| (see also 0). When q > 0, it fol- 
lows from [13, Lemma 6.6] that the assumption © is satisfied with a = 1, v = 
i9 = anc ] c — ||(2t7 — l) _1 || g .oo + 2. Moreover it is simple to show the same is 
true when q = but with c = 5. Hence the condition on e becomes 

e > max 

Some easy estimates then show that this reduces to 

e > a(X) +X + Kx 2 \- 1 (-) ^+ p+ «+ 4 , 

n 

where K > 1 is a suitable constant and a and n are assumed to satisfy n > a > 1. 
From this we immediately obtain the rates established in [l2|, Thm. 2.8] and [T3L 
Thm. 1]. 



TS 4(9+1) r , 4 r , 2(q+l) „ 

A \nJ A \nJ A \nJ A \nJ 



3. A concentration result for ERM schemes 



The proof of our main result Theorem l2.1l is based on a refinement of standard local 
Rademacher average techniques. Since this refinement may be of its own interest 
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we separate its presentation from the proof of 12.11 

Let us begin by introducing some notations. To this end let T be a class of 
bounded measurable functions from Z to R. In order to avoid measurability con- 
siderations we always assume that T is separable with respect to ||.||oo- Given a 
probability measure P on Z we define the modulus of continuity of T by 

LOp. n {T,s) := E T ~p»( sup |Ep/-E T /|), 

E P f<e 

where we emphasize that the supremum is, as a function from Z n to R, measurable 
by the separability assumption on T . In addition note that the supremum is taken 
over all / € T with Ep/ < e, whereas usually the supremum is taken over all / G T 
with Ep/ 2 < e. 

We also need some notations related to ERM-type algorithms: we call C : T x 
Z —> [0, oo) a cost function if C o / := C(/, .) is measurable for all / G T . Given a 
probability measure P on Z we denote by fp,r G J 7 a minimizer of 

f^KcAf) :=E»~pC(/,z). 

Moreover, if P is an empirical measure with respect to T € Z" we write /r,^ and 
T^c.t(-) a $ usual. For simplicity, we assume throughout this section that fp^ and 
]t.t do exist. Furthermore, although there may be multiple solutions we use a 
single symbol for them whenever no confusion regarding the non-uniqueness of this 
symbol can be expected. An algorithm that produces solutions fx,p is called an 
empirical C-risk minimizer. Moreover, if T is convex, we say that C is convex if 
C(., z) is convex for all z € Z. Finally, C is called line- continuous if for all z 6 Z 
and all /,/£f the function 1 1— ► C(f/ + (1 — t)f, z) is continuous on [0, 1]. If T is 
a vector space then every convex C is line-continuous. Now we can formulate the 
main result of this section: 

Theorem 3.1. Let J- be a convex set of bounded measurable functions from Z to 
R, C : T x Z — > [0, oo) be a convex, line- continuous cost function, and P be a 
probability measure on Z. Assume that 

G := {Cof-Cof P j: : feT} 

is separable with respect to ||.||oo- Furthermore assume that there exist constants 
b, B > 0, [3 e [0, 1], and w, W > 0, v € [0, 2], <& € [0, 2), such that 

(6) Halloo < b(E P gf+B 
and 

(7) Ep.g 2 < (b(E P gf + P.)" (w(E P9 y +W) 

for all g £ g. Then for n > 1, x > 1 and e > satisfying 

.„ , l2x(bef 3 + B) v (we^ + W) Zxibe? + B) 

e > 3up, n {G, e) + \ — ^ '- + — ^ ^ 

V n n 

we have 



Pt*(t g : KcAhA < RcAfp,r) + £ ) > 1 ~ e ~ 
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In order to prove Theorem 13.11 let us first recall Talagrand's concentration in- 
equality (see [l4n. The following version of this inequality is derived from Bous- 
quet's result in [4[ using a little trick presented in [l|, Lem. 2.5]: 

Theorem 3.2. Let P be a probability measure on Z and H. be a set of bounded 
measurable functions from Z to R which is separable with respect to \\.\\oo & n d 
satisfies Ep/i = for all h € "H. Furthermore, let M > and t > be constants 
with ||/i||oo < M and Eph 2 < t for all h £ H.. Then for all x > 1 and all n > 1 we 
have 

pn l T£Z n. sapE T h>3ET>~p»8upE T ,h + \ + < e' x . 

V hen hen V n n J 

This concentration inequality is used to prove the following lemma which is a 
generalized version of Lemma 13 in Q and Lemma 5.4 in [I3 |: 

Lemma 3.3. Let P be a probability measure on Z and Q be a set of bounded 
measurable functions from Z to R which is separable with respect to || . || oo • Let us 
assume that Q satisfies 0) and and that there is a constant a € [0, 1) such that 
for all T £ Z n , e > for which there is a g € Q with 

Et9 < ae and Epg > e 



there is also an element g* € Q with 



Et5* < o£ and Ep5* 
Then for all n > 1, x > 1, and all s > satisfying 



l2x(beP + B) v {we & + W) 2x(be f3 + B) 



we have 

Pr* (t E Z n : for all g £ Q with E T g < ae we have E P g < ej > 1 - e' x . 

Proof. We define H := {Epg — g : g £ Q,Epg = e}. Obviously, for all h £ H we 
have Eph = and 

\\h\\ ca <2be ti + 2B =: M , 

Eph 2 < E P g 2 < (be p + BY{we § + W) =: r . 

Moreover, it is also easy to verify that Ti is separable with respect to ||.||oo- As in 
the proof of Lemma 5.4 in [l2l | our assumption on Q now yields 

Pr* (T £ Z n : 3g £ Q with E T g < ae and E P g > e) 

< Pr* (T £ Z n : 3g £ Q with E T g < ae and E P g = e) 

= Pr* (T £ Z n :3g £G with E P .g - E T g > (1 - a)e and E P g = e) 

<P n (T £ Z n : sup (E P g - E T g) > (1 - a)e] 
v see 1 

= P n (T £ Z n : sup E T h > (1 - a)e 
^ hen 
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In order to bound the last probability we will apply Theorem 13.21 To this end 
observe 



/ 2xt M x 

3E T /^p« sup Er'h + \ 1 < (l-a)e. 

hen V n n 



and consequently applying Theorem 13.21 yields 

Pr*(T e Z n : 3g G Q with Erg < ae and E P g > e) < e~ x . □ 
With the help of the above lemma we can now prove Theorem 13. II 



Proof of Theorem \3.1\ For a := we will apply Lemma 13731 to the class Q. To this 
end it obviously suffices to show the richness condition on Q of Lemma 13.31 let 
/ G T satisfy 

E r (Co/-Co/ P| jr)<0 and E P (C o f - C o f P ^) > e . 

For t G [0, 1] we define ft := tf + (1 — t)fp t p. Since T is convex we have ft €E T for 
all t G [0, 1]. By the line-continuity of C and Lebesgue's theorem we find that the 
map h : t h- > Ep(C oft — Co fp t p) is continuous for t G [0, 1]. Since /i(0) = and 
Ml) > e there is a i <E (0, 1] with 

E P {Cof t -Cof P ^) = h(t) = e 

by the intermediate value theorem. Moreover, for this t the convexity of C gives 

E T {Co ft-Co f P ^) < E T (fCo/+(l-t)Co/ P/ ^Co/ P/ ) < 0. 

Now, let e > satisfy the assumption of the theorem. Then e also satisfies the 
assumptions of Lemma |3~31 and hence we find that with probability at least 1 — e~ x 
every / S T with E T (C*o /- Co f P ^) < satisfies E P (C o f - C o f P ^) < e. Since 
we always have 

E T (Co/ T/ -Co/ v ) < 
we obtain the assertion. □ 



4. Proof of the main result 

In order to prove our oracle- type inequality we will apply Theorem 13. II To this end 
we define the regularized cost function C\ by 

C x (x, y, f) := A||/||| + L(y, f(x)) , x e X, y e Y, f e H, 

and the induced cost class 

G(X) := {C x of-C x o f P , x : / e A- 1 / 2 ^} , A > 0. 

Obviously, the C^-risk minimizcr produces the functions fp } \ and /t,a- Note that 
7^l,p(0) < 1 implies fp_\ G \~ x l 2 Bn for all distributions P on X x F, and hence 
the latter in particular holds for the empirical solutions /t,a- However, it was al- 
ready observed in [l2T | that, depending on the approximation error function, sharper 
bounds for ||/t,a|| are possible with high probability. In order to establish such 
sharper bounds we employed a "shrinking technique" in [l2l | which is rather com- 
plicated. The key idea of this paper is to replace the shrinking technique by a 
localization argument based on ([6]) . Consequently, let us first show that regularized 
risk minimizcrs always satisfy the supremum bound ([5]): 
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Lemma 4.1. Let < A < 1, and suppose that g £ £7 (A). Then for any f £ \~^Bh 
such that g = C\ o f — C\ o fp\ we have 

NU < 3 (5££) f + (f^) f +2 and 
Proof. Let us write e := Ep.9. Then we have 

A||/|| 2 ff <x\\f\\% + n L , P (f)-ni tP 

= \\\f P ,x\\ 2 + n L ,p{f Pl x) - n* L P + e 

= a(A) + £ , 

which establishes the second assertion. Consequently, ||L°/||oo<l + ||/||So yields 
||C A o fU < XWfWl + \\L o/IU < a(A) + e + {&) f + (£) f + 1 . 

Analogously, we obtain ||Ca o fp t x\\oo < a (A) + (^ A ^) 2 + L an d therefore we find 

|| ff |U < max(||C* A o ||C A o /^lU) < e + f + (£) f + 2 , 

where in the last step we used a(A) < 1. Now, / 6 \~ x ^Bh implies that ||/||oo < 
A" 1 / 2 and an easy calculation shows that 2 + A a l 2 < 3A 2 -° . Therefore we obtain 

e < E P C X o / = AH/HI, + K L ,p(f) < 2 + H/IICo < 2 + A-f < 3A~^ . 

From this we easily obtain e < 3 1_ '^(-|)^' < 2(j)% , which gives the assertion. □ 

We now prove that a variance bound of the form ([5]) assumed in Theorem 12.11 
implies a variance bound of the form ([7|) assumed in Theorem 13.11 

Lemma 4.2. Let P be a distribution on X x Y and suppose that there exist con- 
stants v > 0, c > 1, and $ £ [0, 1] such that the variance bound assumption ([5]) is 
satisfied for some < A < 1 and all f £ Then for all g £ Q(X) we have 

W < 16c((^) ^ + (^) + l)" (CE,*)* + 2a«(A)) . 

Proof. We use the shorthand notation E for Ep. For g S 5(A) pick an / £ \~^Bjj 
such that g = C\o f — C\o f P \. Now observe that 

Eg 2 =E(C\of-C x of PX ) 2 

= E(A||/|| 2 - A||/ P . A || 2 + Lof-Lo / F>A ) 2 

< 2E(A||/|| 2 - A||/p A || 2 ) 2 + 2E(io / -Lo/ PjA ) 2 

< 2A 2 ||/|| 4 + 2A 2 ||/p A || 4 + 2E(L o / - L o jp.x) 2 

< 4E(L o / - £ o /* p ) 2 + 4E(L o /* p - L o / PA ) 2 + 2A 2 ||/|| 4 + 2A 2 ||/ P , A || 4 . 
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Denote C :— max f||/||oo + 1; ||/p,a||oo + • Then the assumption ([5]) and 
a" 5 + V 9 < 2(a + b)0 for all a, b > 0, imply that 

E(X o / - L o flrf + E(L o f* LP - L o /p A ) 2 

< 2cC™(e(L o/-io /2, P ) + E(i o fp t \ - L o f* L p )Y . 

Since A 2 ||/|| 4 < 1 and A 2 ||/p A || 4 < 1 we hence obtain 

% 2 < 8cC v (E(Lof - Lofl p) + E(Lo/p A - Lof^* +2\ 2 \\f\\ 4 +2\ 2 \\f P ,x 



<8cC v (E(Lof- Lofl P )+E(Lof PA - Lof* L p )) +4(A 2 ||/|| 4 + A 2 ||/p A || 4 '* 
< WcC v (E(Lof - Lofl p) + E(Lo/p A - Lofl p) + A 2 ||/|| 4 + A 2 ||/p A |' 4 " " 

2 ' 



= 16cC v (Eg + 2E(Lo/p A - Lof* L P ) + 2A||/p A || 
< 16cC v ((Egf + 2a 3 (A) 



What is left is to bound C in the right hand side of this inequality. To that end 
observe that Lemma |4 . 1 1 implies 



||/||oc<||/||h< ( 



a(A) +E 5 \i/2 



A 
and 



1/2 



ll/,.lk<ll/,*<(^) 1/2 <(^ 

so that we can bound 

C = max (ll/Hoc +1,||/p,a||oo+i) 

/a(A)+E 5 \i/2 /Eg\i/2 / a (A)\V2 

The following lemma relates the covering numbers of Bh with Wp, n ((?(A), e): 



□ 



Lemma 4.3. Let n e N, and assume that there are constants a > 1 and p £ (0, 2) 
sitc/i that for all S > 0, we have 

sup logAf (B H , 5, L 2 (T X )) < aS-P. 

Tez n 

Then there is a constant c^ v > depending only on L and p such that for all 
distributions P on X x Y , and all A £ (0, 1], e > we have 

W ^«7(A) |e )<c ilP n U «((-U_ + i) ^ (_) ,(-4— +l) (J 

where r e > swp ge g c Epg 2 and t/ e := {g G £?(A) : Epg < e}. 

Proof. Our first goal is to bound the covering numbers of Q E . To this end recall that 
for g := C\ o / - C A o /p A e £ , Lemma O shows that < (<±1±£) 1/2 = : A. 
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With the help of the auxiliary sets Q s := {C\ o / ; / £ ABjj } and H := {L o f : 
f E ABh} we thus obtain 

\ogN(g e , 25, L 2 (T)) < logJV(g £ ,25, L 2 {T)) 

<log(^ + l) +logAf(H,S,L 2 (T)) 

<log(jUl) +\ogAf(hB H 6 1 L 2 (T X )) . 

Furthermore, the Lipschitz assumption (JXJ) implies the right hand side is bounded 

by 

log(i + l)+logAA^,-^,L 2 (T x )). 

Consequently, there is a constant cl iP > depending only on L and p such that for 
all S > we have 

supJo g Af(G £ ,6,L 2 (T)) < ac L , p (^L±£) V* < ac L , p (f^±£ + 
By symmetrization, and the proofs of [3, Lem. 2.5] and [12L Prop. 5.7] we thus find 

a,P,n(a(A) >e )<c £ ,nu«|(-U__ + l) rP (-) ,(-4— + l) (-) }• 

□ 

Proof of Theorem \2.1\ Let g := Ca o / — Ca ° /p,a for some / S \^ x l 2 Bn- Lemma 
14.11 implies that we have a supremum bound 

'Epg\f /a(A)\§ 



Nloo < 3( — ) +{ — ) 



+ 2. 



Because of the variance bound assumption ([3J , Lemma 14.21 implies we have a vari- 
ance bound of the form 

E Pfl » < 16c((^) k + (^) 1/2 + lJ((E P9 r + 2a*(A) 

< 48c((^) § + (^) § + l) * ((E P9 f + 2a*(A) 

< (3(^) f + (^) § + 2) * (48c(E P5 r + 96ea*(A)) . 

Therefore we have variance and supremum bounds of the form ([7]) and ([5]) with the 
values b = 3A~f , f3 = f, B = + 2, w = 48c, 1/ = and = 96ca' 5 (A). 

Denote r e := 3 4 2 6 cA' 5 (^^ + l)^. Then for g G 5(A) with E P . 9 < e we obtain 

Ep.g 2 <(fe /5 + S) iy (we'' + ^) 



= 48c 



^ 6 -K(i) f+ (^) ,+i )*( e * + °' (A) 

<96.9-3-2c( °W +g + l) i (a(A) + e)'' 

< TV. 
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Consequently we can apply Lemma 14.31 to obtain that 

Wp, n (£(A),£) 

< cl, p max 



{(£^±f + l) ^ e) *,(f*I±£ + l) ' (3 ^ 

f (\\ , 2gp+{f + 2i9)(2-p) 1 

I tf( 2 -p) /a(A)+e A i /a\2 /a(A) + £ ai/o\ 

<c L ^ m ax|A , (-4- + 1) {n)\-H^ + 1 ) (n) 

We also bound the terms 



■L>V,:< + BV(<r:» ID irr _ ., v , - N _ / g (A) + e 



< \ j = 384^2^ 

n 

and 



+ 7 



2s (be*- + £?) _ 2z //exf | /a(A)xf | \ < 24x/ g(A) + e | w 
n n\VA/ V A / / — n V A / 

and then observe that Theorem 13.11 implies that there is a constant K > 1 such 
that 

Pr*(T e (X x y) n : ^ Ca , p (/ T: a) < K Cx Afp,x) + e) > 1 - e~* , 
whenever 



g)^a(A)+£ | ^ s ^ay jo^+e + ^ i joj 

^ a(A)+e | ^f+i^r-j? ^ a(A)+£ | ^ f a: j 

If we further constrain by s > a(A) + A we find that it is sufficient to satisfy 

f ~ 2ap+(^ + 2<Q(2-p) i _ a 2 

£>max| a( A) + A,KA^(-) (-) ,*(-) (-) , 



£ >ifmax< A 4 



A" 



*-*(fA;)'---(f) f s 



Since $ e (0, 1] and v £ [0, 2] it follows that < v + 2d < 4 which implies that 

± i V 
2 ' 4 



2ap+(ti+2i?)(2 p) ^ ^ an( j ^ + j < 1. Consequently we find that it is sufficient to 



satisfy 

K 2 a \ 8-2"P-(» + 20)(2-p) / X^'a \ (2 + PK2-Q) 



£ > max< a(A) + A 



I \ 2op + ^(2-p) / II o(2 + p) 

A 4 n/ VA •* n 



\ *-(»+-*) / Kx 



X^n J \A 2 « y 
Therefore we find that (with a change in the value of the constant K) if 
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then 

K L , P {fT,\) < M\Mx\\h + K L , P {f T ,x) =K c >AfT,x) <K Cx ,p(fp,x) + e 

= a(\)+TZ* LP + e 

holds with probability not less than 1 — e~~ x . □ 



5. Examples 

Here we perform the analysis mentioned in Examples l2.2l and l2.3l Let us first apply 
the oracle inequality to bound 7lL a .p(fT,\) — T^-L a p with high probability. To that 
end we now derive some variance bounds. First observe that [ill Table 3] shows 
that the modulus of convexity S^ a \[-B,B]{ e ) of the function ip a :t h- > \t\ a restricted 
to the interval [—B,B] satisfies 

(8) ^ a \\-B tB] {e) > ^^B-V 

Consequently @, Lemma 15] implies that modulus of convexity of Hl^.p for func- 
tions satisfying H/IU < B is bounded below by £i£L^Ii 2 a - 2 B Q - 2 E 2 > S^ll x 
B a ~ 2 e 2 . Moreover, the mean value theorem implies that 

\\h - y\ a - \t 2 ~ y\ a \ < a(max(i 1 + l,i 2 + l)) Q |*i - * 2 | 

so that the loss function / i— * L a (y, f(x)) has a Lipschitz constant less than 
a(max{||/i|| 00 , ||/ 2 ||oo} + l)° • Now let 

fl a ,P € argmin{7?.L Q ,p(/)|/ : X — > R measurable} 

and define 57(2, y) := \Hx) — y\ a — |/£ p(x) — y| Q . Then the extension mentioned 
after the statement of [2], Lemma 14] to non-margin loss functions implies that we 
have the variance bound 

_~ 8a (maxlll/IUH^ pllooj + l) 2 ^ 2 

Eol < • : n — Eo/ 

(maxlH/IUJI/^^IU})"- 2 

< ^^(maxlll/Hoo, H/^pIIoo} + 1) Q E. 9/ .. 

Observe that the right hand side of these bounds goes to oo as a — > 1 since 
is not strictly convex. Also note that such a bound, but with different constants, 



follows directly from [ll|, Equation 28]. Since < 1 we then obtain 

%?<^^(||/||oo + 2) Q E 5/ .. 

Therefore we can apply Theorem 1 2 . 1 1 with v = a and $ = 1 to obtain that there 
exists a constant if Q > 1 such that for all < A < 1, e > 0, x > 1 satisfying 

K a a \ (2+J.1I2-0.) fK n x\2=z 

a(2 + p) 

A n 



(9) e>max a Q (A) + A,(— _j , j 
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we have 

(10) Pr*(TeZ n :K La .p(fT,x)-n* La , P <a a (X)+e) > 1 - e~* 

where a Q (-) is the approximation error function defined with respect to the risk 

nLc " p r-i 

In [13j it was shown that the assumption /£ P G H implies that a a (X) < 
X\\fl a p\\% for all A > 0. We assume without loss of generality that a a (X) < A. 
Let us first consider when a = 2. If we now assume (A ra ) is a strictly positive null- 
sequence with Xn +P ^ 2 n — > oo then it is easy from the convention ([3]) applied to the 
inequality §§§ that our learning rate is of the form A n thus finishing the proof for 
Example 12.21 Now consider the case 1 < a < 2. Then ([9]) becomes 



^ f / \ \ i a , «_ / K a a\ (2+^(2-0.) ,__°_/i^ Q a;\ 

(11) e > maxj a a (\) + A, A 2 -° (^ J ,A 2 -<* J 



(2+p)(2-c) ^__g_ / K a X ^ 

n ) \ n 

Moreover when n > if Q a elementary calculations show that it is sufficient to satisfy 



/,M . /\\ , \ , \-^- ( K a a\ (2 + p)(2-a) 

(12) e>a Q (A) + A + A 2- a2 ;2- Q ^ J 

If we now assume A = n~ K . Then elementary calculations show that we obtain the 
rate n~ K independently of the value a when k < 7^ and when n > we obtain 

the rate n ~^ + ^ ( - K ~ ^\ 

Let us now assume that the conditional distributions P(y\x) are symmetric. We 
now proceed to derive a calibration inequality 

n L2 Ah,x) - ni 2>P < 9{n La AfT,x) - n* LaiP ) 

so that we can apply the bounds on H-L a ,p{fT,\) — Tt\ P ) defined by (fTO]) and (fT2|l 
to obtain bounds on TZj„ .p{fT.\) — P in terms of a. Since we will need results 
and notations from [ll| we first give a brief outline of its content. Consider a loss 
function L and a measure Q on F. Then the associated inner risk is defined as 

C L , Q (t) = J L(y,t)dQ(y), t e M, 
and can be used to compute the risk 

n L , P (f)= I c L . P( . Vx) {f{x))dP x {x). 

J x 

The minimal inner risk is defined as C* L q := inf teM.CL,Q{t)- Consider now another 
loss function L. Then the calibration function 5 T ; (e, Q) is defined as the largest 
function comparing the excess inner risks, i.e. 

We shall also find it convenient to consider the template loss £ mean introduced in 



ljj and defined by 

L mean (Q,t):=\EQ-t\, teR 

and its inner risk 

C iBe „,cW = J \EQ-t\dQ(y), te 
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We can now proceed to derive the appropriate calibration inequality function '5 
for comparing L2 and L a . Since P(y\x) is symmetric for all x, [111 . Theorem 3.23] 
implies that we have mean calibration with calibration function bounded below by 

*max,L mc „,L„(£, Q) > | [_ (2+e) ,2+e] ( 2 e) 

where <5^ Q |[-(2+ e ),2+e] is the modulus of convexity of the function tp a restricted to 
the interval [— (2 + e), 2 + e]. By ([8|) we then obtain 

Vax,L me an,-0c,( £ > > 2 ( 2 + £ ) £ ■ 

Since [ll|, Equation (38)] states (5 max ,i 3 ,i Q (e, <9) = <Jmax,L moan ,L a (V^, Q) we find 



We now seek to apply [ill, Theorem 2.13]. In that notation we bound 
B f = sup \f(x)-E(y\x)\ 2 < Hi/Hoc + lf. 

a; 

Denote 0(e) := 2fez±i(2 + ^/e) a ~ 2 e. Then since 

IT _ 

> 



cfe 
and 

d 2 



d ((2 + V~e) a - 2 e) = (2 + ^)"^ 3 (2 + |v^) 



iL ((2 + VS) a - 2 e) =(a- 2)e~i (f + ^) (2 + Vi)- 4 < 
we conclude that is strictly monotonically increasing and concave. It follows that 





lll/IU + l 


I 2 ) 




\\f\\oo + l\ 


2 



4>%* f (e) > ^ neo+1 \*(e) = '' £ = ^^(3 + ll/lloo) 



where ** denotes the Fenchel-Legendre bi-conjugate operation (see e.g. [10]). It then 
follows from ll|, Theorem 2.13] that 

(13) n L2 Af) - ni 2 . P < - 2 (3 + ||/||oo) 2 - a (^ Q ,p(/) - n* La , P ) 

a(a — 1) 

for all bounded measurable functions /. Note that the constant in this inequality 
goes to 00 as a goes to 1. The deeper reason for this behaviour is that il; „ is strictly 
convex when a > 1 but not strictly convex when a — 1 as discussed in [llfl . 

We conclude from inequalities (fT3|) and (fTOj) that whenever (fl2|) is satisfied that 
with probability greater than 1 — e~ x we have 

n L2 Ah,x) - ni 2 , P < 4 (3 + ||/t.a||oo) 2 -"£. 

a(a — 1) 



However we also know from the last line of the proof of Theorem 12 . 1 1 that whenever 
(fT2"l) is satisfied that with probability greater than 1 — e~ x 
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Now since 1 < f when (fT2|) is satisfied it follows that 

(3+||/t,a||oo) 2 - q < (3 + ^y|) 2 ^< ( 3 + ^)(^) 1_f 
so that with probability greater than 1 — 2e~ x we have 

a{a — 1) 

If we now apply the inequality a a (\) < A and let 

e := 2A + A 2-<*x 2 -<>y J , 

then we see that with probability greater than 1 — 2e~ x we have 
T^L 2 ,p(fT,\) — 7^l 2 ,p 

< -^-(3 + V2)Af - 1 ( 2A + A" A* A ^^) 2 ^ 
a(a — 1) \ V n I J 

<c a (A + A-^^(^) j 

for some constant c a which depends only on a. 

Now let us consider the case when A = nr K . Then disregarding the constants 
the righthand side becomes 

n~ K + n^ K_5 T?-' 7 ^ _ 2T? 

so that we obtain performance bounds of the form n~ p with 

(2 ,2 2 \ 

p = mm /c, h K . 

F \ ' 2 + p y 2 + p J 2~aJ 

Simple calculations show that when k < then p = k independently of the value 

of a and when k > then p = + (i^z — ^ n ^ ne l^ter case it is 

important to observe that the rates get worse as a increases towards 2. Indeed one 
can show that p < in the interval 

2 

2 - (k )(2 + p) < a < 2. 

Moreover one can see that smaller a minimizes the sensitivity to the degree to 
which k is greater than . 
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